Semantic analysis of offensive language categories from existing annotated corpora

نویسندگان

چکیده

There exists a vast amount of different offensive language corpora for English language, annotation criteria and category naming. In this paper, we explore 21 categories language. We use natural processing techniques to find correlations between the based on seven data sets. employ several traditional (TF–IDF) advanced (fastText, GloVe, Word2Vec, BERT, other deep NLP methods) uncover similarities among categories. The findings reveal that most are densely interconnected, while two-level hierarchical representation them can be provided. also transfer analysis Slovenian compare both researched languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting a bilingual semantic grammar from FrameNet-annotated corpora

We present the creation of an English-Swedish FrameNet-based grammar in Grammatical Framework. The aim of this research is to make existing framenets computationally accessible for multilingual natural language applications via a common semantic grammar API, and to facilitate the porting of such grammar to other languages. In this paper, we describe the abstract syntax of the semantic grammar w...

متن کامل

Modifying Existing Annotated Corpora for General Comparative Evaluation of Parsing

We argue that the current dominant paradigm in parser evaluation work, which combines use of the Penn Treebank reference corpus and of the Parseval scoring metrics, is not well-suited to the task of general comparative evaluation of diverse parsing systems. In (Gaizauskas et al., 1998), we propose an alternative approach which has two key components. Firstly, we propose parsed corpora for testi...

متن کامل

A Description Language for Syntactically Annotated Corpora

This paper introduces a description language for syntactically annotated corpora which allows for encoding both the syntactic annotation to a corpus and the queries to a syntactically annotated corpus. In terms of descriptive adequacy and computational efficiency, the description language is a compromise between script-like corpus query languages and high-level, typed unification-based grammar ...

متن کامل

Training Dependency Parsers from Partially Annotated Corpora

We introduce a maximum spanning tree (MST) dependency parser that can be trained from partially annotated corpora, allowing for effective use of available linguistic resources and reduction of the costs of preparing new training data. This is especially important for domain adaptation in a real-world situation. We use a pointwise approach where each edge in the dependency tree for a sentence is...

متن کامل

Extracting Collocations from syntactically annotated biomedical Corpora

This thesis investigates the extraction of frequently used phrases (so called collocations) from biomedical text sources. The extraction of uninterrupted collocation candidates is introduced. For interrupted candidates, with gaps between their subcomponents, a new technique using suffix tries is developed. It is based on the iterative extension of frequent smaller patterns. This reduces computa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Uporabna informatika

سال: 2022

ISSN: ['1318-1882', '2630-435X']

DOI: https://doi.org/10.31449/upinf.151